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Abstract 

We present Context Forest (ConF) — a technique for 
predicting properties of the objects in an image based on 
its global appearance. Compared to standard nearest- 
neighbour techniques, ConF is more accurate, fast and 
memory efficient. We train ConF to predict which aspects 
of an object class are likely to appear in a given image 
(e.g. which viewpoint). This enables to speed-up multi- 
component object detectors, by automatically selecting the 
most relevant components to run on that image. This is par¬ 
ticularly useful for detectors trained from large datasets, 
which typically need many components to fully absorb the 
data and reach their peak performance. ConF provides a 
speed-up of2xfor the DPM detector [1] and of lOx for the 
EE-SVM detector [ 2 ]. To show ConF’s generality, we also 
train it to predict at which locations objects are likely to 
appear in an image. Incorporating this information in the 
detector score improves mAP performance by about 2% by 
removing false positive detections in unlikely locations. 

1. Introduction 

Global image appearance carries information about 
properties of objects in the image, such as their appearance 
and location. For instance, a picture of a highway taken 
from a car is more likely to contain cars from the back view¬ 
point than from the side (fig. 1). A picture of a racing track 
is more likely to contain racing cars than minivans. This 
also applies to other classes. A person in a road scene is 
more likely to be standing than sitting. Another property 
that can be inferred from global image appearance is the 
rough location of object instances [ 3 ]. For instance, an ur¬ 
ban scene with cars parked in front of a building, shows cars 
in the bottom half of the image (fig. 2). 

In this paper we exploit this observation for the benefit 
of object detection. We propose a method, coined Context 
Forest (ConF), for learning the relation between the global 
image appearance and the properties of the objects it con¬ 
tains. Given only the global appearance of a test image, 
ConF retrieves a subset of training images that contain ob¬ 
jects with similar properties. ConF is based on the Ran¬ 


dom Forest [4, 5] framework, which provides high com¬ 
putational efficiency and the ability to learn complex, non¬ 
linear relations between global image appearance and ob¬ 
jects properties. It is very flexible and only requires these 
properties to be defined through a distance function between 
two object instances, e.g. their appearance similarity or dif¬ 
ference in location. We demonstrate ConF by learning to 
predict two properties: aspects of objects appearance and 
location. ConF trained to predict appearance is then used to 
speed up multi-component object detectors [1, ^ ] and ConF 
trained for object location is used to remove false positives. 

Multi-component detectors [1, ] model the appearance 
variations within an object class as a mixture of several 
components. Each component is trained to recognize a par¬ 
ticular aspect of objects appearance. For example, cars 
could have viewpoint components [1], such as front and 
back views, or subclass components [ 6 ], such as taxi, am¬ 
bulance and minivan. There is growing evidence [7] that 
the performance of object detectors tends to saturate as the 
amount of training data increases. We conduct an exten¬ 
sive experiment (sec. 4) on a dataset containing 15x the 
amount of training data than PASCAL VOC 2012 [8] us¬ 
ing two popular multi-component detectors: Deformable 
Part-based Model [1] (DPM) and the Ensemble of Exem¬ 
plar SVMs [1 ] (EE-SVM). Our results show that as the size 
of training set grows, these detectors can absorb the addi¬ 
tional intra-class appearance variations and continue to im¬ 
prove their performance, but only if the amount of com¬ 
ponents is increased accordingly. This extends the related 
findings of [ 7 ] on single linear SVM HOGs to the DPM and 
EE-SVM cases. Although increasing the number of compo¬ 
nents significantly improves performance, it also makes the 
detectors much slower, as all components need to be run on 
every test image. 

We use ConF to select a subset of model components 
which is most relevant to a particular test image. We then 
run only those components, obtaining a speed-up. Our 
experiments show that ConF delivers a 2x speed-up for 
DPM [1] and lOx speed-up for EE-SVM [ 2 ] without loss 
of accuracy (sec. 5). Hence, ConF makes large multi- 
component detectors practical. This is particularly useful 
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Figure 1: Illustration of ConF selecting components for a test image (sec. 3.1.) 


for EE-SVMs, as their number of components is truly very 
large (i.e. as many as there are training instances). Inter¬ 
estingly, in some cases we even gain a small improvement 
in accuracy, by not running some components that would 
produce false positive detections. 

Moreover, we train a second ConF to predict at which 
positions and scales objects are likely to appear in a given 
test image, analogue to [3]. By incorporating this infor¬ 
mation in the detector score at test time, we reduce the 
false positive rate by removing detections in unlikely lo¬ 
cations. Experiments show an mAP improvement of 2%. 
This demonstrates that ConF is a general technique that can 
predict various kinds of object properties. 

Finally, we carry out an extensive comparison to stan¬ 
dard nearest-neighbour techniques for such context-based 
predictions [3, 9-12], which shows that ConF predicts ob¬ 
ject properties from global image appearance more accu¬ 
rately, it is much faster and more memory efficient (sec. 5). 

The rest of the paper is organized as follows. We start by 
reviewing related work in sec. 2. Sec. 3 explains ConF, our 
main contribution. In sec. 4 we present a first series of ex¬ 
periments, which study the behaviour of multi-component 
detectors on large training sets and thus motivate ConF for 
component selection. Finally, we present a second series of 
experiments to validate the benefits of ConF on two object 
detectors in sec. 5. 

2. Related work 

Context. The use of context for object detection is a broad 
research area. Some works [11, 13-15] model context as 
the interactions between multiple object classes in the same 
image. In this paper, we model context as a relation be¬ 
tween global image appearance and properties of the ob¬ 
jects within them, as in [3, 9, 10, 12]. These works have 
shown that global image descriptors give a valuable cue 
about which classes might be present in an image and where 
they are located. Since then, many object detectors [1, - 


19] employed such global context to re-score their detec¬ 
tions, thereby removing out-of-context false-positives. A 
similar approach was taken by [9, 17] for image parsing. 
All of these works have a nearest neighbour core: they first 
retrieve a small subset of training images which are most 
globally similar to a test image, and then transfer the rel¬ 
evant statistics of the object properties in this retrieval set 
to the test image. In our work instead the retrieval set is 
estimated by ConF, which is explicitly trained to return im¬ 
ages containing objects with similar properties to those in 
the test image. ConF has several advantages over nearest- 
neighbour approaches: (i) it can learn highly complex non¬ 
linear dependencies between the global descriptor and the 
object property. As a result, it estimates it more accurately; 
(ii) in large training sets, nearest neighbour becomes very 
slow, as its complexity is linear in their size. ConF is much 
faster and more memory efficient; (iii) ConF supports any 
objective function, which might even be evaluated on a dif¬ 
ferent data representation than the input at test time. This 
is a crucial feature for our problem, as we want to predict 
properties of objects, but based on global image features. 

Multi-component detectors. These detectors [1, 2, 6, 
20-2^ ] model each aspect of an object class as a separate 
component. They are very popular but can be slow when 
trained from large training sets as they need many compo¬ 
nents to reach peak performance. While we present exper¬ 
iments on DPM [1] and EE-SVM [2], ConF can benefit all 
kind of multi-component detectors, and it allows them to 
use large training sets without compromising speed. 

EE-SVM. The EE-SVM [2] is an extreme case of multi- 
component detector, where a separate component is created 
for each training example. Due to that EE-SVM benefit the 
most from dynamically selecting components with ConF. 
EE-SVMs are widely used for many applications beyond 
object detection [23-29]. As they explicitly associate a 
training example to an object in the test image, they enable 
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Figure 2: Schematic illustration ConF predicting object locations and their use to improve the DPM score (sec. 3.3). 


transferring meta-data such as segmentation masks [2, 27], 
3D models [2], subcategory [29] and viewpoint [28]. They 
can as well be used for discovering mid-level discriminative 
patches [23, 25, 30, 31] for scene classification [23, 30], for 
forming a dictionary of object parts [23, 25], or to charac¬ 
terize a certain geo-spatial area [31], All these applications 
can potentially be sped-up by ConF. 

3. Estimating object properties from context 

In this section we exploit the observation that global im¬ 
age appearance contains information about properties of the 
objects inside it. We focus on two kinds of properties: as¬ 
pects of appearance and location in the image. We propose 
a new method, coined Context Forest (ConF), based on the 
Random Forest framework [4, 5], which learns the relation 
between global image features and the properties of the ob¬ 
ject in that image. Given only the global image appearance 
of a test image, ConF retrieves a subset of training images 
that contain objects with similar properties. 

3.1. Context Forest (ConF) 

Given a training set T the goal of ConF is to map the 
global appearance f{It) of a test image I t into a retrieval 
set 1Z C T. We want to construct a mapping, such that 
properties of objects (e.g. appearance, location) in images 
of 7 Z are similar to the properties of objects in I t . We de¬ 
scribe here our ConF technique in general, then specialize 
it to the object appearance property in sec. 3.2 and to the 
object location property in sec. 3.3. 

ConF at training time learns an ensemble of decision 
trees (forest) that operates on global image features </>(/). 
We construct each tree by recursively splitting the training 
set T at each node. We want the leaves of the trees to con¬ 
tain images whose objects properties are compact according 
to some measure c(Ti), where Ti are the training images in 


leaf l. Each internal node n contains a binary split func¬ 
tion /(0(/), 0 n ), where 6 n are its parameters. Let T n be the 
training images that reached node n, then /(</>(/), 6 n ) will 
split T n into two subsets Ti and T r . We use axis-aligned 
weak learners as / [5]. The split function /(</>(/), 6 n ) ap¬ 
plies a threshold to one of the dimensions of image feature 
vector (j){I ). Following the extremely randomized forest 
approach [32], for each node we randomly sample several 
thousand possible splits 6 and choose one that maximizes 
the joint compactness: 


6 n = arg max c(Ti) + c{%) (1) 
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Compactness is defined as 
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where N is the number of ground-truth object bounding- 
boxes in set T and D{w^Wj ) is a distance measure be¬ 
tween the properties of two object bounding-boxes Wi and 
Wj. Note how the inner summation in eq. (2) is an es¬ 
timation of the density of the distribution induced by all 
bounding-boxes in { T\wi }, evaluated at Wj. This value 
is high if Wj has other bounding-boxes nearby. The esti¬ 
mate is done with a Gaussian Kernel Density estimator [33] 
(KDE). We learn the standard deviation a from the entire 
training set once before we train the forest. This determines 
the scale of the problem, i.e. at which range of distances two 
bounding-boxes should be considered close. We compute a 
as follows. For each training bounding-box Wi we compute 
its k-nearest neighbours in the whole training set and com¬ 
pute the standard deviation over them. Finally, we set a as 
the median of these standard deviations over all bounding- 
boxes. 














By employing different compactness measures c we can 
use ConF to learn relations between different object prop¬ 
erties and global image features. Later we show how to 
use it for selecting components relevant for a test image 
(sec. 3.2), for estimating likely object locations in a test im¬ 
age (sec. 3.3). 

ConF at test time operates in two phases (see fig. 1 and 
2). First, test image I t is passed through the forest, reaching 
a leaf in each tree. Thereby, each tree selects the subset of 
training images contained in that leaf. We now accumulate 
these selections over all trees in the forest to form the score 
rj(Ii, It) for Ii G T, which is the number of trees that have 
selected Ii . We now construct the retrieval set IZ by select¬ 
ing the k most frequently selected training images. In our 
experiments k = 10 . 

Note how the split function / and the compactness mea¬ 
sure c operate in structurally different spaces. While / op¬ 
erates on the global image features 0(7), c is measuring the 
similarity of properties of objects inside the images. In this 
fashion ConF learns the relation between the two. Impor¬ 
tantly, c is neither convex nor differentiable in 0(7) and we 
are only able to learn this inter-space relation thanks to the 
unique advantages offered by Random Forests. 


bility and iteratively pick them until their combined proba¬ 
bility mass exceeds a threshold 7 . This threshold controls a 
trade-off between running few components and getting high 
detection performance. An interesting aspect of our formu¬ 
lation is that the number of selected components changes 
depending on the test image. A test image with a character¬ 
istic appearance matching training images with a system¬ 
atic recurrence of a few components will lead to a peaky 
p(£j |7 t ). In this case it is safe to run only a few components 
and we obtain a substantial speedup. On the other hand, if 
the ConF is uncertain about the contents of the test image, 
then the entropy of p(£j\I t ) will be high, and many com¬ 
ponents will be selected. In the extreme case, for a very 
difficult test image, our procedure naturally degenerates to 
the default case of running all components. 

3.3. ConF for object location 

At test time, a typical detector scores hundreds of thou¬ 
sands of windows over the whole test image I t , based on 
their appearance only. We propose here to augment the de¬ 
tector’s scores by adding knowledge about likely positions 
and scales of the object class, derived purely from the global 
appearance of I t . 


3.2. ConF for component selection 

Here we assume that each component of an object de¬ 
tector has been previously trained from a visually compact 
set of object instances, and that we know the component id 
£j of each training instance j. In EE-SVMs each training 
instance leads to a unique component. In DPM the compo¬ 
nent id of a training instance can be inferred by the output 
of the training procedure [34]. Based on this information, 
we train a ConF to select a small subset of components to 
run on a given test image I t . 

Training. To train ConF for components selection we de¬ 
fine the distance D(wi,Wj) in eq. (2) as the L2 distance 
between the HOG descriptors of object bounding-boxes Wi 
and Wj. 

Test. We pass the test image I t through ConF obtaining 
a retrieval set IZ. We then estimate a posterior distribution 
P(€j\It) over detector components given the test image 
It . As a training image 7 might contain multiple instances 
from different components, each training image is ‘labelled’ 
by a distribution over components p(£j |7). We estimate the 
component distribution for the test image I t as the average 
over the training images in the retrieval set IZ 


Training. We train two ConFs to predict likely object po¬ 
sitions and scales, respectively. To do so, we employ a 
different measure of compactness, substituting the distance 
function between two windows in eq. ( 2 ) with 7)pos (or 
T^scale)- We define 7) PO s (wi , Wj ) as the L2 distance be¬ 
tween the centres of object bounding-boxes Wi and Wj. We 
define £>scale(wj, wj) = max(^, • max(^, as 
the difference in their scale (W and H refer to width and 
height). 

Test. At test time, we first pass the test image I t through 
ConF obtaining a retrieval set IZ , and then compute the fol¬ 
lowing score for each window w in the test image 


1 ^—-\ 1 1 D{w,wi ) 2 
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where N is the number of object instances in the retrieval 
set IZ, and D is either 7)pos or Locale • We learn a from 
the entire training set as in sec. 3.1. This score captures how 
likely a window is to cover an object based on its location 
or scale. Finally, we linearly combine the location and scale 
scores with detector’s score of a test window w. The usual 
non-maxima suppression stage follows. 




(3) 4. Multi-component detectors on large training 

sets 


Based on this distribution, we can now select which com- In this section we study how multi-component detectors 

ponents to run on I t . We rank components by their proba- behave when trained on a large training set. We observe that 
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+Imgs 
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PAS CAL 12 [8] 

1161 

2017 

1161 

482 

710 

482 

ImageNet [ ] 

6383 

7120 

6383 

4550 

6631 

4550 

SUN2012 [36] 

828 

1779 

828 

- 

- 

- 

Labelme [3 ] 

6566 

16743 

6566 

- 

- 

- 

UIUC [3 ] 

828 

889 

500 

- 

- 

- 

PASCAL-lOx [7] 

- 

- 

- 

4065 

6454 

4065 

Total 

15766 

28548 

15438 

10107 

13071 

10107 

Our training set 

14125 

25774 

13830 

9097 

12407 

9097 

Our test set 

1641 

2774 

1608 

1010 

1388 

1010 


Table 1: Statistics of the large-scale dataset we assembled by 
combining images from existing source datasets. The column 
‘+Imgs’ reports the number of positive images per dataset; ‘Objs’ 
is the total number of instances of the class in all positive images; 
‘-Imgs’ is the number of negative images we sampled. 


it is necessary to increase the number of components as the 
size of the training set grows, so as to absorb the additional 
intra-class appearance variation. This motivates using ConF 
for component selection to speed-up the resulting large mix¬ 
ture models at test time. We perform experiments (sec. 4.3) 
with two multi-component detectors (DPM [ ] and EE- 
SVM [ ], sec. 4.1) on a large-scale dataset (sec. 4.2). A 
related study was done by [7], using mainly a single com¬ 
ponent HOG detector ([7], fig. 3-8). Their only experiment 
with a DPM [1] is on a small training set of 900 faces ([7], 
fig. 9-10). Hence, our study extends [7] to large-scale train¬ 
ing of DPMs and EE-SVMs. 

4.1. Multi-component detectors 

DPM [ 1 ] represents an object class as a collection of parts 
arranged in a deformable configuration. DPMs use a mix¬ 
ture of components, each specialized to an aspect of the 
training data to better capture the variation in appearance 
that the class exhibits. Each component is trained on a sub¬ 
set of the training data with compact appearance, e.g. dif¬ 
ferent viewpoints [1] or subclasses [6]. We use the publicly 
available implementation [34]. 

EE-SVM [2] is a model composed of a separate linear 
SVM classifier for every training instance (exemplar). Each 
exemplar is represented by a rigid HOG template. The SVM 
is trained using the exemplar as the only positive against all 
negatives in the training set. We refer to a single exemplar 
SVM as a component of the EE-SVM model, by analogy 
with DPM components. At test time, each component is 
run on the image independently, producing many candidate 
detections. These are then filtered by non-maxima suppres¬ 
sion in a final stage, where different components compete 
for the same image region. We use the publicly available 
implementation [3 ( ]. We assemble a large-scale dataset of 
two classes: car and horse (table 1). Below we discuss the 
car dataset in detail. The horse dataset was designed analo¬ 
gously. 


4.2. Dataset 

Source datasets. We combine 6 existing datasets: PAS¬ 
CAL VOC 2012 [ 8 ], ImageNet [35], LabelMe [37], SUN 
2012 [36], UIUC [38] and PASCAL-lOx [7]. PASCAL 
VOC 2012, PASCAL lOx, and ImageNet contain a vari¬ 
ety of images, with both difficult, cluttered images and eas¬ 
ier images with big centred cars. UIUC has low resolu¬ 
tion, gray-scale images of side-view cars. LabelMe and Sun 
2012 contain wide open street scenes with small cars. 

Positive images and ground-truth annotations. We col¬ 
lected all images with bounding-box annotations on cars. 
We took several steps to ensure a clean dataset. First, we re¬ 
moved duplicate images, which were a few hundreds. Next, 
we removed incorrect bounding-boxes not covering cars or 
covering a car multiple times. Finally, as some images have 
unannotated cars, we annotated all missing instances with 
bounding-boxes for our entire test set (e.g. images from Im¬ 
ageNet). This enables reliable performance measurements. 

Negative images. We collect negative images from each 
dataset, so that it contributes an equal number of positive 
and negative images. To ensure variation, we randomly 
sampled these negative images. 

Train/test splits. We split the dataset into training (90%) 
and test (10%) sets. Then we split the training set further 
into 6 increasingly large subsets, which we use in sec. 4.3 
to study how the performance of the detectors evolves with 
increasing training data. We ensure that each split contains 
images from all source datasets in the same proportions 
to avoid dataset bias issues [40]. These proportions corre¬ 
spond to the percentage of images that each source dataset 
contributes to the entire dataset (e.g. 8 % of all car images 
are from PASCAL VOC 2012, and 40% from ImageNet). 

4.3. Experiments 

We evaluate the performance of DPM [1] and EE- 
SVM [! ] as a function of the amount of training data and 
model capacity. First, we analyse how the performance on 
a fixed test set changes as the amount of training data in¬ 
creases. EE-SVM increase its capacity naturally with each 
positive example. DPM instead needs its capacity con¬ 
trolled manually. In the second experiment we increase the 
capacity of the DPM trained on the largest training set, hop¬ 
ing that it will help absorbing the training data better. 

Increasing the amount of training data. For each class, 
we split the training set into 6 nested subsets, which contain 
1%, 5%, 10%, 25%, 50% and 100% of the training images 
(sec. 4.2). Each set is contained in all larger ones, and is 
composed of images from all source datasets in the same 
proportions. Fig. 3 shows the average precision (AP) and 
recall on our test set. 
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Figure 3: Results of experiments when training DPMs and EESVMs with increasing amounts of training data. 


For DPM, we use 4 mixture components (2 and their mir¬ 
rored versions). For both classes, increasing the amount of 
training data yields a modest improvement in AP. Initially, 
performance increases roughly logarithmically in the num¬ 
ber of training images, but eventually saturates. Surpris¬ 
ingly, this happens quite early. A DPM car detector trained 
on the whole dataset only performs 1 % better than when 
trained on a third of the data. We observe similar behaviour 
on horses. Recall, on the other hand, saturates almost imme¬ 
diately for cars and even decreases for horses as the training 
set grows. 

The EE-SVM detector demonstrate continuous, non¬ 
saturating growth in both AP and recall as the training data 
increases. The growth in recall is particularly strong, as it 
improves by more than 20 % for both classes after seeing the 
full training set. 

Increasing DPM capacity. In fig. 4 we help DPM to bet¬ 
ter absorb the training data by progressively increasing its 
number of components, while keeping the training set fixed 
to the largest one. For both classes, performance increases 
steadily as the number of components grows from 4 (2+2 
mirrored) to 16 ( 8 + 8 ) for cars and 10 (5+5) for horses. After 
that the model starts to overfit: performance decreases and 
eventually (30 components) drops below the performance 
of the smallest model (4 components). For cars, each com¬ 
ponent clearly represents a different aspect (mostly view¬ 
point). 

This overfitting behaviour, even on such a large train¬ 
ing set is an interesting finding. The practical implication 
is that the DPM user has to be careful and manually con¬ 
trol the capacity to obtain the best performance. In contrast, 
EE-SVM does not overfit even when training > 20000 com¬ 
ponents. Yet, while in terms of growth EE-SVM behaves 
very well, we note that its absolute detection performance 
is lower than DPMs. 

4.4. Conclusions 

In general, both DPM and EE-SVM do benefit from large 
training sets. The key to continued growth for both meth- 
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Figure 4: Evolution of AP when increasing the number of DPM 
components. The performance increases steadily, but eventually it 
starts overfitting. 


ods is control of capacity. EE-SVM automatically increases 
its capacity, by adding a new component with each training 
sample. For DPM, the best results are achieved through se¬ 
lecting the right capacity manually, which also corresponds 
to a rather large number of components (compared to the 
2+2 traditionally used on Pascal VOC [8]). Using such large 
models comes at the cost of longer runtime, as all compo¬ 
nents have to be applied to a test image. This is especially 
problematic for EE-SVMs, as running ten thousand compo¬ 
nents takes about 10 minutes for a single image. In the next 
section we demonstrate how ConF substantially reduces this 
computational burden by selecting only a small subset of 
components most relevant to a particular test image and run 
only those. 

5. Experiments with ConF 

In this section we evaluate the performance of ConF for 
component selection and location estimation. As global im¬ 
age descriptors fi(I) we extract SURF [41], LAB and SIFT 
[42] descriptors on a dense grid at multiple scales. For each 
feature type we train a class-specific codebook of 1000 vi¬ 
sual words and construct a 2-level spatial pyramid [43]. Ad¬ 
ditionally, we also extract a GIST [44] descriptor for the im¬ 
age. Overall, we train ConF on a 16000 dimensional feature 
space, using 750 trees for each task. 















































(a) DPM (b) EESVM 

Figure 5: Results of applying ConF for the automatic component selection. The points on the plot correspond to different choices for the 
threshold 7 (sec. 3.2). The horizontal axis corresponds to the average amount of components used. The vertical axis corresponds to the 
AP of the detector using components selected by ConF. 



Figure 6: Detection obtained before (top row) and after (bottom row) applying ConF for component selection. Green bounding-boxes 
highlight correct detections, while red ones show false positives. 


Quality of retrieval sets. We quantify how similar object 
bounding-boxes from a test image I t are to those in the re¬ 
trieval set 7 Z returned by the method as follows 
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Automatic component selection. We now use ConF to 
select object detector components relevant for a given test 
image. We use the best performing settings from sec. 4.3, 
i.e. 16 (10) component DPM for cars (horses) and very 
large EE-SVMs using all training exemplars, i.e. 25774 for 
cars and 12407 for horses. 


where Z is number of pairs of bounding-boxes in I t and 1Z. 
The distance D and standard deviation a vary depending 
on the property (appearance, position, scale) as defined in 
sec. 3 . 2 , 3 . 3 . 

Table 2 show results averaged over the test set, higher 
values are better. As a baseline, we return the whole train¬ 
ing set as the retrieval set. This leads to a generic prior 
on image properties, independent of the test image. More¬ 
over, we compare to the traditional way of building retrieval 
sets by k-nearest neighbours (kNN) [3, 9, 10, 12, 27], de¬ 
fined on the same features as ConF. Both kNN and ConF 
greatly outperform the baseline, proving they return mean¬ 
ingful retrieval sets. This confirms the observation that the 
global image appearance conveys information about the ob¬ 
jects properties inside the images. Moreover, ConF returns 
better retrieval sets than kNN across all object properties 
and retrieval set sizes evaluated. 


Fig. 5 shows the evolution of AP while increasing the 
percentage of components used (higher is better). We com¬ 
pare to building retrieval sets by kNN, and to a baseline 
which randomly selects components without looking at the 
test image. ConF outperforms the baseline and kNN for 
both object classes, for both detection models, and over the 
whole range of the plots. By employing ConF, we closely 
match the performance of the full DPM model by running 
roughly half of the components. We match the performance 
of a full EE-SVM when running less than 10% of the com¬ 
ponents. Even in the extreme case of running just one EE¬ 
SVM component, the AP is about 90% of that of the full 
model. Interestingly, for EE-SVM on the horse class, ConF 
improves AP by 3% over the full ensemble using all compo¬ 
nents, when running lOx fewer components. There is also 
a minor improvement in AP for EE-SVM on the car class, 
when running half of the components. The AP improve- 


























































Figure 7: Detection obtained before (top row) and after (bottom row) applying ConF as location model. Green bounding-boxes highlight 
correct detections, while red ones show false positives. 
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Table 2: Evaluation of the quality of retrieval sets for predicting object properties. Each entry represents the average density of the 
retrieval set evaluated at the objects properties in the test images. We consider two sizes for the retrieval set \1Z\ — 1 and \1Z\ — 10. 
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Table 3: The results of augmenting the detector score with the 
location model derived by ConF (sec. 3.3) and NN compare to not 
using location model at all (None). 

ment comes from dropping some components that lead to 
false positives. 

These experiments demonstrate the ability of ConF to 
select relevant components given just global image appear¬ 
ance. This makes EE-SVMs practical even when trained 
from large sets with tens of thousands of exemplars (the av¬ 
erage runtime for a test image decreases from 10 minutes to 
1 minute on our machine with 4 i5-core 3.10GHz processor 
and 16 GB memory). Fig. 6 shows some example results. 

Object locations. Here we demonstrate how ConF 
trained to estimate the location of objects from global im¬ 
age features can improve detection performance by down¬ 
grading the score of false positives at unlikely locations. As 
tab. 3 shows, this improves AP for both classes and both 
detectors (+2% for cars and +1% for horses). Instead, kNN 


does not bring any improvement, further confirming that 
ConF returns better retrieval sets. Fig. 7 shows example 
results. 

Computational and memory efficiency. ConF does not 
only offer better performance than kNN, but is also more 
memory and computationally efficient. In terms of compu¬ 
tation, kNN requires a number distance computations linear 
in the number of training images, where ConF requires only 
a logarithmic number of threshold operations. 

In terms of memory, kNN stores all feature vectors of all 
images in the training set. For cars, this amounts to 1.68 
GB. For each internal node ConF stores a threshold, a fea¬ 
ture id and the ids of its children, amounting to 16 bytes. 
The leaves store the indices of the training images they con¬ 
tain, for a total of exactly the number of training images x 2 
bytes overall (per tree). For cars, there are < 900 internal 
nodes on average per tree. As we store 750 trees per class, 
the grand total is only 27 MB (60 x less than kNN). 

Conclusions. We propose a novel method — ConF, which 
learns the relation between the global image appearance and 
properties of objects in the image. We show how ConF can 
be employed to dynamically select a small number of com¬ 
ponents for a test image. This improves speed and, some¬ 
times, even detection performance. We have also shown 
how to use ConF to predict likely object location to reduce 
false positive rates. 
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